knitr::opts_chunk$set(echo = TRUE)
#install.packages("dplyr","ade4","magrittr","cluster","factoextra","cluster.datasets","xtable","kableExtra","knitr","summarytools")
knitr::opts_chunk$set(echo = TRUE)
A distance function or a metric on \(\mathbb{R}^m,\:m\geq 1\), is a function \(d:\mathbb{R}^m\times\mathbb{R}^m\rightarrow \mathbb{R}\).
A distance function must satisfy some required properties or axioms.
There are three main axioms.
A1. \(d(\mathbf{x},\mathbf{y})= 0\iff \mathbf{x}=\mathbf{y}\) (identity of indiscernibles);
A2. \(d(\mathbf{x},\mathbf{y})= d(\mathbf{y},\mathbf{x})\) (symmetry);
A3. \(d(\mathbf{x},\mathbf{z})\leq d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\) (triangle inequality), where \(\mathbf{x}=(x_1,\cdots,x_m)\), \(\mathbf{y}=(y_1,\cdots,y_m)\) and \(\mathbf{z}=(z_1,\cdots,z_m)\) are all vectors of \(\mathbb{R}^m\).
We should use the term dissimilarity rather than distance when not all the three axioms A1-A3 are valid.
Most of the time, we shall use, with some abuse of vocabulary, the term distance.
\[ d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{j=1}^m (x_j-y_j)^2}. \]
\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m |x_j-y_j|.\]
x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "euclidian")
## x
## y 8.485281
dist(rbind(x, y), method = "euclidian",diag=T,upper=T)
## x y
## x 0.000000 8.485281
## y 8.485281 0.000000
6*sqrt(2)
## [1] 8.485281
dist(rbind(x, y), method = "manhattan")
## x
## y 12
dist(rbind(x, y), method = "manhattan",diag=T,upper=T)
## x y
## x 0 12
## y 12 0
\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m \frac{|x_j-y_j|}{|x_j|+|y_j|}.\]
x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "canberra")
## x
## y 2
6/6+6/6
## [1] 2
\[d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p.\]
\[\|\mathbf{x}\|_p=d(\mathbf{x},\mathbf{0}),\]
where \(\mathbf{0}\) is the null-vetor of \(\mathbb{R}^m\).
library("ggplot2")
x = c(0, 0)
y = c(6,6)
MinkowDist=c() # Initialiser à vide la liste
for (p in seq(1,30,.01))
{
MinkowDist=c(MinkowDist,dist(rbind(x, y), method = "minkowski", p = p))
}
ggplot(data =data.frame(x = seq(1,30,.01), y=MinkowDist ) , mapping = aes( x=x, y= y))+
geom_point(size=.1,color="red")+
xlab("p")+ylab("Minkowski Distance")+ggtitle("Minkowski distance wrt p")
Produce a similar graph using “The Economist” theme. Indicate on the graph the Manhattan, the Euclidian distances as well as the Chebyshev distance introduced below.
At the limit, we get the Chebyshev distance which is defined by: \[ d(\mathbf{x},\mathbf{y})=\max_{j=1,\cdots,n}(|x_j-y_j|)=\lim_{p\rightarrow\infty} \left[\sum_{j=1} |x_j-y_j|^{p}\right]^{1/p}. \]
The corresponding norm is:
\[ \|\mathbf{x}\|_\infty=\max_{j=1,\cdots,n}(|x_j|). \]
The proof of the triangular inequality A3 is based on the Minkowski inequality:
For any nonnegative real numbers \(a_1,\cdots,a_m\); \(b_1,\cdots,b_m\), and for any \(p\geq 1\), we have: \[ \left[\sum_{j=1}^m (a_j+b_j)^{p}\right]^{1/p}\leq \left[\sum_{j=1}^m a_j^{p}\right]^{1/p} +\left[\sum_{j=1}^m b_j^{p}\right]^{1/p}. \]
To prove that the Minkowski distance satisfies A3, notice that \[ \sum_{j=1}^m|x_j-z_j|^{p}= \sum_{j=1}^m|(x_j-y_j)+(y_j-z_j)|^{p}. \]
Since for any reals \(x,y\), we have: \(|x+y|\leq |x|+|y|\), and using the fact that \(x^p\) is increasing in \(x\geq 0\), we obtain: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \sum_{j=1}^m(|x_j-y_j|+|y_j-z_j|)^{p}. \]
Applying the Minkowski inequality with \(a_j=|x_j-y_j|\) and \(b_j=|y_j-z_j|\), \(j=1,\cdots,n\), we get: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \left(\sum_{j=1}^m |x_j-y_j|^{p}\right)^{1/p}+\left(\sum_{j=1}^m |y_j-z_j|^{p}\right)^{1/p}. \]
To illustrate the Minkowski inequality, draw \(100\) times two lists of \(100\) draws from the lognormal distribution with mean \(1600\) and standard-deviation \(300\). Illustrate with a graph the gap between the two drawn lists.
# Cauchy-Schwartz inequality
The Pearson correlation coefficient is a similarity measure on \(\mathbb{R}^m\) defined by: \[ \rho(\mathbf{x},\mathbf{y})= \frac{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})(y_j-\bar{\mathbf{y}})}{{\sqrt{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})^2\sum_{j=1}^m (y_j-\bar{\mathbf{y}})^2}}}, \] where \(\bar{\mathbf{x}}\) is the mean of the vector \(\mathbf{x}\) defined by: \[\bar{\mathbf{x}}=\frac{1}{n}\sum_{j=1}^m x_j,\]
Note that the Pearson correlation coefficient satisfies P2 and is invariant to any positive linear transformation, i.e.: \[\rho(\alpha\mathbf{x},\mathbf{y})=\rho(\mathbf{x},\mathbf{y}),\] for any \(\alpha>0\).
The Pearson distance (or correlation distance) is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x},\mathbf{y}). \]
Note that the Pearson distance does not satisfy A1 since \(d(\mathbf{x},\mathbf{x})=0\) for any non-zero vector \(\mathbf{x}\). It neither satisfies the triangle inequality. However, the symmetry property is fullfilled.
x=c(3, 1, 4, 15, 92)
rank(x)
## [1] 2 1 3 4 5
x=c(3, 1, 4, 15, 92)
rank(x)
## [1] 2 1 3 4 5
y=c(30,2 , 9, 20, 48)
rank(y)
## [1] 4 1 2 3 5
d=rank(x)-rank(y)
d
## [1] -2 0 1 1 0
cor(rank(x),rank(y))
## [1] 0.7
1-6*sum(d^2)/(5*(5^2-1))
## [1] 0.7
x=c(3, 1, 4, 15, 92)
y=c(30,2 , 9, 20, 48)
tau=0
for (i in 1:5)
{
tau=tau+sign(x -x[i])%*%sign(y -y[i])
}
tau=tau/(5*4)
tau
## [,1]
## [1,] 0.6
cor(x,y, method="kendall")
## [1] 0.6
v=c(3, 1, 4, 15, 92)
w=c(30,2 , 9, 20, 48)
(v-mean(v))/sd(v)
## [1] -0.5134116 -0.5647527 -0.4877410 -0.2053646 1.7712699
scale(v)
## [,1]
## [1,] -0.5134116
## [2,] -0.5647527
## [3,] -0.4877410
## [4,] -0.2053646
## [5,] 1.7712699
## attr(,"scaled:center")
## [1] 23
## attr(,"scaled:scale")
## [1] 38.9551
(w-mean(w))/sd(w)
## [1] 0.45263128 -1.09293895 -0.70654639 -0.09935809 1.44621214
scale(w)
## [,1]
## [1,] 0.45263128
## [2,] -1.09293895
## [3,] -0.70654639
## [4,] -0.09935809
## [5,] 1.44621214
## attr(,"scaled:center")
## [1] 21.8
## attr(,"scaled:scale")
## [1] 18.11629
Consider the following example
Plot the data using a nice scatter plot.
Transform the Height from centimeters (cm) into feet (ft).
Display your data in a table.
Plot the data within a new scatter plot.
What do you observe?
Standardize the two variables Age and Height.
Display your data in a table.
Plot the standardized data within a new scatter plot.
Conclude.
A common simple situation occurs when all information is of the presence/absence of 2-level qualitative characters.
We assume there are \(n\) characters.
*The presence of the character is coded by \(1\) and the absence by 0.
We have have at our disposal two vectors.
\(\mathbf{x}\) is observed for a first individual (or object).
\(\mathbf{y}\) is observed for a second individual.
We can then calculate the following four statistics:
\(a=\mathbf{x\cdot y}=\sum_{j=1}^mx_jy_j.\)
\(b=\mathbf{x\cdot (1-y)}=\sum_{j=1}^mx_j(1-y_j).\)
\(c=\mathbf{(1-x)\cdot y}=\sum_{j=1}^m(1-x_j)y_j.\)
\(d=\mathbf{(1-x)\cdot (1-y)}=\sum_{j=1}^m(1-x_j)(1-y_j).\)
The counts of matches are \(a\) for \((1,1)\) and \(d\) for \((0,0)\);
The counts of mismatches are \(b\) for \((1,0)\) and \(c\) for \((0,1)\).
Note that obviously: \(a+b+c+d= n\).
This gives a very useful \(2 \times 2\) association table.
| Second individual | ||||
|---|---|---|---|---|
| 1 | 0 | Totals | ||
| First individual | 1 | \(a\) | \(b\) | \(a+b\) |
| 0 | \(c\) | \(d\) | \(c+d\) | |
| Totals | \(a+c\) | \(b+d\) | \(n\) |
data=c(
1,0,1,1,0,0,1,0,0,0,
0,1,0,0,1,0,0,0,0,0,
0,0,1,0,0,0,1,0,0,1,
0,1,0,0,0,0,0,1,1,0,
1,1,0,0,1,1,0,1,1,0,
1,1,0,0,1,0,1,1,0,0,
0,0,0,1,0,1,0,0,0,0,
0,0,0,1,0,1,0,0,0,0
)
data=data.frame(matrix(data, nrow=8,byrow=T))
row.names(data)=c("Ilan","Jacqueline","Kim","Lieve","Leon","Peter","Talia","Tina")
names(data)=c("Sex", "Married", "Fair Hair", "Blue Eyes", "Wears Glasses", "Round Face", "Pessimist", "Evening Type", "Is an Only Child", "Left-Handed")
library(knitr)
library(xtable)
library(stargazer)
library(texreg)
library(kableExtra)
library(summarytools)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
set.seed(893)
datat<-as.data.frame(t(data))
datat=lapply(datat,as.factor)
Ilan=datat$Ilan
Talia =datat$Talia
print(ctable(Ilan,Talia,prop = 'n',style = "rmarkdown"))
| Talia | 0 | 1 | Total | |
| Ilan | ||||
| 0 | 5 | 1 | 6 | |
| 1 | 3 | 1 | 4 | |
| Total | 8 | 2 | 10 |
| Coefficient | \(s(\mathbf{x},\mathbf{y})\) | \(d(\mathbf{x},\mathbf{y})=1-s(\mathbf{x},\mathbf{y})\) |
|---|---|---|
| Simple matching | \(\frac{a+d}{a+b+c+d}\) | \(\frac{b+c}{a+b+c+d}\) |
| Jaccard | \(\frac{a}{a+b+c}\) | \(\frac{b+c}{a+b+c}\) |
| Rogers and Tanimoto (1960) | \(\frac{a+d}{a+2(b+c)+d}\) | \(\frac{2(b+c)}{a+2(b+c)+d}\) |
| Gower and Legendre (1986) | \(\frac{2(a+d)}{2(a+d)+b+c}\) | \(\frac{b+c}{2(a+d)+b+c}]\) |
| Gower and Legendre (1986) | \(\frac{2a}{2a+b+c}\) | \(\frac{b+c}{2a+b+c}\) |
To calculate these coefficients, we use the function: dist.binary(). available in the ade4 package.
All the distances in the ade4 package are of type \(d(\mathbf{x}.\mathbf{y})= \sqrt{1 - s(\mathbf{x}.\mathbf{y})}\).
library(ade4)
a=1
b=3
c=1
d=5
dist.binary(data[c("Ilan","Talia"),],method=2)^2
Ilan
Talia 0.4
1-(a+d )/(a+b+c+d)
[1] 0.4
dist.binary(data[c("Ilan","Talia"),],method=1)^2
Ilan
Talia 0.8
1-a/(a+b+c)
[1] 0.8
dist.binary(data[c("Ilan","Talia"),],method=4)^2
Ilan
Talia 0.5714286
1-(a+d )/(a+2*(b+c)+d)
[1] 0.5714286
# One Gower coefficient is missing
dist.binary(data[c("Ilan","Talia"),],method=5)^2
Ilan
Talia 0.6666667
1-2*a/(2*a+b+c)
[1] 0.6666667
From: GAN et al
Gower’s coefficient is a dissimilarity measure specifically designed for handling mixed attribute types or variables.
See: GOWER, John C. A general coefficient of similarity and some of its properties. Biometrics, 1971, p. 857-871.
The coefficient is calculated as the weighted average of attribute contributions.
Weights usually used only to indicate which attribute values could actually be compared meaningfully.
The formula is: \[ d(\mathbf{x},\mathbf{y})=\frac{\sum_{j=1}^m w_j \delta(x_j,y_j)}{\sum_{j=1}^m w_j}. \]
The wheight \(w_j\) is put equal to \(1\) when both measurements \(x_j\) and \(y_j\) are nonmissing,
The number \(\delta(x_j,y_j)\) is the contribution of the \(j\)th measure or variable to the dissimilarity measure.
If the \(j\)th measure is nominal, we take
\[
\delta(x_j,y_j)\equiv \begin{cases}0,
\text{ if } x_j=y_j;\\1,\text{ if } x_j \neq y_j.\end{cases}
\]
If the \(j\)th measure is interval-scaled, we take instead: \[ \delta(x_j,y_j)\equiv \frac{|x_j-y_j|}{R_j}, \] where \(R_j\) is the range of variable \(j\) over the available data.
Consider the following data set:
Begonia
Genêt
library(cluster)
library(dplyr)
data <-flower %>%
rename(Winters=V1,Shadow=V2,Tubers=V3,Color=V4,Soil=V5,Preference=V6,Height=V7,Distance=V8) %>%
mutate(Winters=recode(Winters,"1"="Yes","0"="No"),
Shadow=recode(Shadow,"1"="Yes","0"="No"),
Tubers=recode(Tubers,"1"="Yes","0"="No"),
Color=recode(Color,"1"="white", "2"="yellow", "3"= "pink", "4"="red", "5"="blue"),
Soil=recode(Soil,"1"="dry", "2"="normal", "3"= "wet")
)
row.names(data)=c("Begonia","Broom","Camellia","Dahlia","Forget-me-not","Fuchsia",
"Geranium", "Gladiolus","Heather","Hydrangea","Iris","Lily","Lily-of-the-valley",
"Peony","Pink Carnation","Red Rose","Scotch Rose","Tulip")
res=lapply(data,class)
res=as.data.frame(res)
res[1,] %>%
knitr::kable()
| Winters | Shadow | Tubers | Color | Soil | Preference | Height | Distance |
|---|---|---|---|---|---|---|---|
| factor | factor | factor | factor | ordered | ordered | numeric | numeric |
flower[1:2,]
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0 1 1 4 3 15 25 15
## 2 1 0 0 2 1 3 150 50
max(data$Height)-min(data$Height)
## [1] 180
max(data$Distance)-min(data$Distance)
## [1] 50
\[ \frac{|1-0|+|0-1|+|0-1|+1+|1-3|/2+|3-15|/17+|150-25|/180+|50-15|/50}{8}\approx 0.8875408 \]
Cluster package description available at this link.
library(cluster)
(abs(1-0)+abs(0-1)+abs(0-1)+1+abs(1-3)/2+abs(3-15)/17+abs(150-25)/180+abs(50-15)/50)/8
## [1] 0.8875408
dist<-daisy(data[,1:8],metric = "Gower")
as.matrix(dist)[1:2,1:2]
## Begonia Broom
## Begonia 0.0000000 0.8875408
## Broom 0.8875408 0.0000000
stargazer(USArrests,header=TRUE, type='html',summary=FALSE,digits=1)
| Murder | Assault | UrbanPop | Rape | |
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9 | 276 | 91 | 40.6 |
| Colorado | 7.9 | 204 | 78 | 38.7 |
| Connecticut | 3.3 | 110 | 77 | 11.1 |
| Delaware | 5.9 | 238 | 72 | 15.8 |
| Florida | 15.4 | 335 | 80 | 31.9 |
| Georgia | 17.4 | 211 | 60 | 25.8 |
| Hawaii | 5.3 | 46 | 83 | 20.2 |
| Idaho | 2.6 | 120 | 54 | 14.2 |
| Illinois | 10.4 | 249 | 83 | 24 |
| Indiana | 7.2 | 113 | 65 | 21 |
| Iowa | 2.2 | 56 | 57 | 11.3 |
| Kansas | 6 | 115 | 66 | 18 |
| Kentucky | 9.7 | 109 | 52 | 16.3 |
| Louisiana | 15.4 | 249 | 66 | 22.2 |
| Maine | 2.1 | 83 | 51 | 7.8 |
| Maryland | 11.3 | 300 | 67 | 27.8 |
| Massachusetts | 4.4 | 149 | 85 | 16.3 |
| Michigan | 12.1 | 255 | 74 | 35.1 |
| Minnesota | 2.7 | 72 | 66 | 14.9 |
| Mississippi | 16.1 | 259 | 44 | 17.1 |
| Missouri | 9 | 178 | 70 | 28.2 |
| Montana | 6 | 109 | 53 | 16.4 |
| Nebraska | 4.3 | 102 | 62 | 16.5 |
| Nevada | 12.2 | 252 | 81 | 46 |
| New Hampshire | 2.1 | 57 | 56 | 9.5 |
| New Jersey | 7.4 | 159 | 89 | 18.8 |
| New Mexico | 11.4 | 285 | 70 | 32.1 |
| New York | 11.1 | 254 | 86 | 26.1 |
| North Carolina | 13 | 337 | 45 | 16.1 |
| North Dakota | 0.8 | 45 | 44 | 7.3 |
| Ohio | 7.3 | 120 | 75 | 21.4 |
| Oklahoma | 6.6 | 151 | 68 | 20 |
| Oregon | 4.9 | 159 | 67 | 29.3 |
| Pennsylvania | 6.3 | 106 | 72 | 14.9 |
| Rhode Island | 3.4 | 174 | 87 | 8.3 |
| South Carolina | 14.4 | 279 | 48 | 22.5 |
| South Dakota | 3.8 | 86 | 45 | 12.8 |
| Tennessee | 13.2 | 188 | 59 | 26.9 |
| Texas | 12.7 | 201 | 80 | 25.5 |
| Utah | 3.2 | 120 | 80 | 22.9 |
| Vermont | 2.2 | 48 | 32 | 11.2 |
| Virginia | 8.5 | 156 | 63 | 20.7 |
| Washington | 4 | 145 | 73 | 26.2 |
| West Virginia | 5.7 | 81 | 39 | 9.3 |
| Wisconsin | 2.6 | 53 | 66 | 10.8 |
| Wyoming | 6.8 | 161 | 60 | 15.6 |
set.seed(123)
ss <- sample(1:50,15)
df <- USArrests[ss, ]
df.scaled <- scale(df)
stargazer(df.scaled,header=TRUE, type='html',summary=FALSE,digits=1)
| Murder | Assault | UrbanPop | Rape | |
| New Mexico | 0.6 | 1.0 | 0.2 | 0.6 |
| Iowa | -1.7 | -1.5 | -0.7 | -1.4 |
| Indiana | -0.5 | -0.9 | -0.1 | -0.5 |
| Arizona | -0.2 | 1.1 | 0.9 | 0.5 |
| Tennessee | 1.0 | -0.1 | -0.5 | 0.1 |
| Texas | 0.9 | 0.1 | 0.9 | -0.04 |
| Oregon | -1.0 | -0.4 | 0.01 | 0.3 |
| West Virginia | -0.8 | -1.3 | -2.0 | -1.6 |
| Missouri | -0.01 | -0.2 | 0.2 | 0.2 |
| Montana | -0.8 | -1.0 | -1.0 | -0.9 |
| Nebraska | -1.2 | -1.0 | -0.3 | -0.9 |
| California | -0.01 | 0.9 | 1.7 | 1.4 |
| South Carolina | 1.3 | 1.0 | -1.3 | -0.3 |
| Nevada | 0.8 | 0.7 | 1.0 | 2.0 |
| Florida | 1.6 | 1.6 | 0.9 | 0.6 |
Remark: All these functions compute distances between rows of the data.
Remark: If we want to compute pairwise distances between variables, we must transpose the data to have variables in the rows.
We first compute Euclidian distances
dist.eucl <- dist(df.scaled, method = "euclidean",upper = TRUE)
stargazer(as.data.frame(as.matrix(dist.eucl)[1:3, 1:3]),header=TRUE, type='html',summary=FALSE,digits=1)
| New Mexico | Iowa | Indiana | |
| New Mexico | 0 | 4.1 | 2.5 |
| Iowa | 4.1 | 0 | 1.8 |
| Indiana | 2.5 | 1.8 | 0 |
round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Iowa",])^2)),1)
[1] 4.1
round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Indiana",])^2)),1)
[1] 2.5
round(sqrt(sum((df.scaled["Iowa",]
-df.scaled["Indiana",])^2)),1)
[1] 1.8
library("factoextra")
dist.cor <- get_dist(df.scaled, method = "pearson")
round(as.matrix(dist.cor)[1:3, 1:3], 1)
## New Mexico Iowa Indiana
## New Mexico 0.0 1.7 2.0
## Iowa 1.7 0.0 0.3
## Indiana 2.0 0.3 0.0
round(1-cor(df.scaled["New Mexico",],df.scaled["Iowa",]),1)
## [1] 1.7
round(1-cor(df.scaled["New Mexico",],df.scaled["Indiana",]),1)
## [1] 2
round(1-cor(df.scaled["Iowa",],df.scaled["Indiana",]),1)
## [1] 0.3
library(factoextra)
fviz_dist(dist.eucl)
{width=“60%”, height=“150px”}
What is the number of possible partitions ?
The discordance between the data and a given partition is denoted by \(e\).
We use the technique of local optimization.
A neighborhood of partitions is defined for each ption.
Starting from an initial partition, search through a set of partitions at each step.
Move from the partition to a partition in its neighborhood for which \(e\) is minim.
If the neighborhoods are very large, it is cheaper computationally to move to the first partition discovered in the neighborhood where \(e\) is reduced from its present value.
A number of stopping rules are possible.
For example, the search stops when \(e\) is not reduced by movement to the neighborhood.
The present partition is locally optimal in that it is the best partition in its neighborhood.
Consider partitions of the five (\(n=5\)) beef foods \(\{\text{BB, BR,BS,BC,BH}\}\) into three clusters (\(k=3\)).
Totally, there are 25 such ??.
A plausible neighborhood for a partition is the set of partitions obtained by transferring an object from one cluster to another.
For the partition (BB BR) (BS) (BC BH), the neighborhood consists of the following ten partitions:
Let \(\mathbf{x}^j\equiv (x_1^j,\cdots,x_n^j)\) the vector of values for the object \(j\), \(j=1,\cdots ,m.\)
The variables are assumed scaled.
The partition has \(k\) disjoint clusters clusters \(C_1,\cdots,C_k\), which are the indices of the objects in the various clusters.
Let \(m_l\) be the number of objects in cluster \(C_l\).
Each of the \(m\) objects lies in just one of the \(k\) clusters.
Note that \(\sum_{l=1}^k m_l=m\).
The vector of means over the objects in cluster \(C_l\) is denoted by \(\bar{\mathbf{x}}^{l}\), with \[ \bar{\mathbf{x}}^{l}\equiv\frac{1}{m_l}\sum_{j\in C_l}\mathbf{x}^{j}=(\bar{x}_1^l,\cdots,\bar{x}_n^l),\:l=1,\cdots, k, \] where \[ \bar{x}_i^l\equiv \frac{\sum_{j\in C_l}x_i^{j}}{m_l},\:i=1,\cdots, n; \:l=1,\cdots,k. \]
The distance between the object \(j\) and the cluster \(l\) is \(d(\mathbf{x}^j,\bar{\mathbf{x}}^l)\), where \(d\) is taken to be the Euclidian distance \[d(\mathbf{x}^j,\bar{\mathbf{x}}^l)=||\mathbf{x}^j-\bar{\mathbf{x}}^l||=\bigg[\sum_{i=1}^n(x_i^j-\bar{x}_i^l)^2\bigg]^{1/2},\:j=1,\cdots,m;\:l=1,\cdots,k.\] where \(||\mathbf{\cdot}||\) is the Euclidian norm.
The error of the partition is measuredtaken to bye
\[
e=\sum_{l=1}^{k}\sum_{j\in C_l}d^2(\mathbf{x}^j,\bar{\mathbf{x}}^l)=
\sum_{l=1}^{k}\sum_{j\in C_l}
||\mathbf{x}^j-\bar{\mathbf{x}}^l||^2.
\]
Alternatively, we have
\[ e=\sum_{j=1}^{m}d^2(\mathbf{x}^j,\bar{\mathbf{x}}^{l(j)})=\sum_{j=1}^{m}\sum_{i=1}^n||\mathbf{x}^j-\bar{\mathbf{x}}^{l(j)}||^2, \]
where \(l(j)\) is the index of the cluster of object \(j\).
\[ \Delta e= \frac{m_ld^2(\mathbf{x}^1,\bar{\mathbf{x}}^l)}{m_l+1}-\frac{m_{l(1)}d^2(\mathbf{x}^1,\bar{\mathbf{x}}^{l(1)})}{m_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \]
Prove that the error variation is indeed given by:
\[ \Delta e= \frac{m_ld^2(\mathbf{x}^1,\bar{\mathbf{x}}^l)}{m_l+1}-\frac{m_{l(1)}d^2(\mathbf{x}^1,\bar{\mathbf{x}}^{l(1)})}{m_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \] # K-MEANS APPLIED TO FOOD NUTRIENT DATA * Only the first eight foods will be considered. * Only three variables, food energy, protein, and calcium as a percentage of recommended daily allowances are used. * The eight foods \((m=8)\) are partitioned into three clusters (\(k=3\)).
#library("cluster.datasets")
#write.csv(rda.meat.fish.fowl.1959,"Hartigandat%a1.csv")
df<-read.csv("Hartigandata1.csv")
print(df)
## X name energy protein fat calcium iron
## 1 1 Braised beef 11 29 28 1 26
## 2 2 Hamburger 8 30 17 1 27
## 3 3 Roast beef 18 21 39 1 20
## 4 4 Beefsteak 12 27 32 1 26
## 5 5 Canned beef 6 31 10 2 37
## 6 6 Broiled chicken 8 29 3 1 14
## 7 7 Canned chicken 5 36 7 2 15
## 8 8 Beef heart 5 37 5 2 59
## 9 9 Roast lamb leg 8 29 20 1 26
## 10 10 Roast lamb shoulder 9 26 25 1 23
## 11 11 Smoked ham 11 29 28 1 25
## 12 12 Pork roast 11 27 29 1 25
## 13 13 Pork simmered 11 27 30 1 24
## 14 14 Beef tongue 6 26 14 1 25
## 15 15 Veal cutlet 6 33 9 1 27
## 16 16 Baked bluefish 4 31 4 3 6
## 17 17 Raw clams 2 16 1 10 60
## 18 18 Canned clams 1 10 1 9 54
## 19 19 Canned crabmeat 3 20 2 5 8
## 20 20 Fried haddock 4 23 5 2 5
## 21 21 Broiled mackerel 6 27 13 1 10
## 22 22 Canned mackerel 5 23 9 20 18
## 23 23 Fried perch 6 23 11 2 13
## 24 24 Canned salmon 4 24 5 20 7
## 25 25 Canned sardines 6 31 9 46 25
## 26 26 Canned tuna 5 36 7 1 12
## 27 27 Canned shrimp 3 33 1 12 26
df<-df[1:8,c(3,4,6)]
df
## energy protein calcium
## 1 11 29 1
## 2 8 30 1
## 3 18 21 1
## 4 12 27 1
## 5 6 31 2
## 6 8 29 1
## 7 5 36 2
## 8 5 37 2
# The data contain some errors
df[3,1]<-13 # Error in line 3
df[6,1]<-4 # Error at line 6
df[7,3]<-1 # Error at line 7
df
## energy protein calcium
## 1 11 29 1
## 2 8 30 1
## 3 13 21 1
## 4 12 27 1
## 5 6 31 2
## 6 4 29 1
## 7 5 36 1
## 8 5 37 2
rownames(df)<-c("BB","HR","BR","BS","BC","CB","CC","BH")
df
## energy protein calcium
## BB 11 29 1
## HR 8 30 1
## BR 13 21 1
## BS 12 27 1
## BC 6 31 2
## CB 4 29 1
## CC 5 36 1
## BH 5 37 2
colnames(df)<-c("Energy","Protein","Calcium")
df
## Energy Protein Calcium
## BB 11 29 1
## HR 8 30 1
## BR 13 21 1
## BS 12 27 1
## BC 6 31 2
## CB 4 29 1
## CC 5 36 1
## BH 5 37 2
km.res<-kmeans(df,3)
km.res
## K-means clustering with 3 clusters of sizes 4, 1, 3
##
## Cluster means:
## Energy Protein Calcium
## 1 5.00000 33.25000 1.5
## 2 13.00000 21.00000 1.0
## 3 10.33333 28.66667 1.0
##
## Clustering vector:
## BB HR BR BS BC CB CC BH
## 3 3 2 3 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 47.75000 0.00000 13.33333
## (between_SS / total_SS = 77.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Remark: The location of a knee is generally considered as an appropriate number of clusters.
km.res$cluster # cluster number
## BB HR BR BS BC CB CC BH
## 3 3 2 3 1 1 1 1
km.res$size # Cluster size
## [1] 4 1 3
km.res$centers # Cluster means
## Energy Protein Calcium
## 1 5.00000 33.25000 1.5
## 2 13.00000 21.00000 1.0
## 3 10.33333 28.66667 1.0
library(ggplot2)
library(factoextra)
fviz_cluster(km.res, df, ellipse.type = "norm")
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
fviz_cluster(km.res, data = df,
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
The average dissimilarity for the case objects 4 and 8 are selected is 2.30, which is considerably less than the value of 9.37, found when 1 and 5 were the representative objects.
Alternatively a PAM program can be used by entering a matrix of dissimilarities between objects.
This process is continued until \(k\) objects have been found.